Download auto data from the Statistical Learning book website here: http://www-bcf.usc.edu/~gareth/ISL/data.html
Today, we are going over Hierarchical clustering, K-Means Clustering, PCA, and ICA.
# read in Auto data
Auto_data <- read_csv("/Users/ay1392/Desktop/Auto.csv")
## Parsed with column specification:
## cols(
## mpg = col_double(),
## cylinders = col_double(),
## displacement = col_double(),
## horsepower = col_character(),
## weight = col_double(),
## acceleration = col_double(),
## year = col_double(),
## origin = col_double(),
## name = col_character()
## )
#remove cars with unknown horsepower and set horsepower to numeric
Auto_data <- Auto_data %>%
filter(horsepower != "?") %>%
mutate(horsepower = as.numeric(horsepower)) %>%
as.data.frame()
#save car names
Auto_data_names <- Auto_data$name
#data to cluster
Auto_data_clust <- Auto_data[,1:8]
dim(Auto_data_clust)
## [1] 392 8
#392 is too much for a demo, so lets take the first 25
Auto_data_clust <- Auto_data_clust[1:25,]
rownames(Auto_data_clust) <- Auto_data_names[1:25]
Step 1. Assign each item to it’s own cluster. We start with 25 clusters, one for each car.
Step 2. Calculate a proximity matrix between each cluster.
Step 3. Find the pair of clusters closest to each other.
Step 4. Merge these clusters and then recalculate similarity between clusters. Some options are: single linkage (distance is calculated from the nearest neighbors), complete linkage (distance is calculated from furthest neighbor), average linkage (distance is calculated from mean of different clusters).
Step 5. Repeat Step 3 and 4 until there is only one cluster.
Step 1. Each car is a cluster.
Step 2. Create a distance matrix from Auto_data_clust.
help("dist")
hierarchical_dist <- as.matrix(dist(Auto_data_clust, method = "euclidean"))
#View(hierarchical_dist)
Step 3. Find the two cars that are the most similar to each other and print the names of those two cars
diag(hierarchical_dist) <- NA
arrayInd(which.min(hierarchical_dist), dim(hierarchical_dist))
## [,1] [,2]
## [1,] 23 15
#postitions 23 and 15 are the most similar. Lets go back to the names of the cars
Auto_data_names[23]
## [1] "saab 99e"
Auto_data_names[15]
## [1] "toyota corona mark ii"
Step 4. Merge the two clusters together using average linkage.
#replace pos 15 with the average of pos 15 and 23
hierarchical_dist[,15] <- apply((hierarchical_dist[,c(23,15)]),1,mean)
hierarchical_dist[15,] <- apply((hierarchical_dist[c(23,15),]),2,mean)
#remove pos 23
hierarchical_dist <- hierarchical_dist[-23,-23]
#now position 15 represents the cluster containing the saab99e and the toyota corona mark ii
Step 5. To complete the algorithm, go back to step 3 and iterate through all of the previous steps until there are no more rows left
diag(hierarchical_dist) <- NA
arrayInd(which.min(hierarchical_dist), dim(hierarchical_dist))
## [,1] [,2]
## [1,] 4 3
#postitions 4 and 3 are the most similar
Auto_data_names[4]
## [1] "amc rebel sst"
Auto_data_names[3]
## [1] "plymouth satellite"
Now that we know how the algorithm works, let’s use the R function hclust. Plot the Dendogram resulting from clustering the Auto_data_clust using average linkage.
hierarchical_dist <- dist(Auto_data_clust, method = "euclidean")
tree <- hclust(hierarchical_dist, method="average")
plot(tree)
There is one more element to hierarchical clustering: Cutting the tree. Here, we can control how many clusters we want or the height of the tree.
#help(cutree)
# cut tree into 3 clusters
tree <- hclust(hierarchical_dist, method="average")
plot(tree)
tree_k2 <- cutree(tree, k = 2)
# plot the tree before running this line
rect.hclust(tree, k = 3, h = NULL)
Principal Components Analysis is a linear dimensionality reduction algorithm. If you want to learn more about linear algebra, I suggest the MIT Open Courseware class here : https://ocw.mit.edu/courses/mathematics/18-06-linear-algebra-spring-2010/ There are two ways of doing PCA, Single Value Decomposition (SVD), and the method we will use today, using the covariance matrix of the data.
Step 1. Center data by subtracting the mean.
Step 2. Calculate covariance matrix of data.
Step 3. Perform Eigendecomposition of the covariance matrix. i.e. represent the matrix in terms of it’s eigenvalues and eigen vectors
Step 4. Multiply the eigen vectors by the original data to express the data in terms of the eigen vectors.
Step 1. Center the data by subtracting the mean of the each column from the values in that column
Auto_data_clust_pca <- data.matrix(Auto_data_clust)
Center_auto <- apply(Auto_data_clust_pca, 2, function(x) x - mean(x))
Step 2. Calculate covariance matrix of the Auto data
Covariance_auto <- cov(Center_auto)
Step 3. Calculate eigen values and vectors
Eigen_value_auto <- eigen(Covariance_auto)$value
#columns are the eigen vectors
Eigen_vector_auto <- eigen(Covariance_auto)$vector
Step 4. Multiply the eigen vector matrix by the original data.
PC <- as.data.frame(data.matrix(Center_auto) %*% Eigen_vector_auto)
ggplot(PC, aes(PC[,1], PC[,2])) + geom_point(aes(PC[,1], PC[,2]))
#+ geom_text(aes(label=Auto_data_names[1:8]), nudge_x = -2.5, nudge_y = 400)
Step 5. Find out which principal components explain the variance in the data.
#for each component, take the cumulative sum of eigen values up to that point and and divide by the total sum of eigen values
round(cumsum(Eigen_value_auto)/sum(Eigen_value_auto) * 100, digits = 2)
## [1] 99.52 99.95 100.00 100.00 100.00 100.00 100.00 100.00
Principal component 1 and 2 explain 99.99 percent of the variance. Principal component 1,2, and 3 together explain 100% of the variance in the data.
Now that we know how PCA works, lets use the R funtion prcomp.
help("prcomp")
autoplot(prcomp(Auto_data_clust_pca))
ICA is an algorithm that finds components that are independent, subcomponents of the data.
Step 1. Whiten the data by projecting the data onto the eigen vectors (PCA).
Step 2. Solve the X=AS equation by maximizing non-gaussianty in the variables(components) in S.
This results in a matrix S with components that are independent from each other.
We will use the fastICA algorithm.
First we will go backwards. Create a matrix S with the independent components
#create two signals
S <- cbind(cos((1:500)/10), ((500:1)/1000))
par(mfcol = c(1, 2))
plot(S[,1], type="l")
plot(S[,2], type="l")
Create a mixing matrix A
A <- matrix(c(0.5, 0.9, 0.423, 0.857), 2, 2)
Mix S using A
X <- S %*% A
par(mfcol = c(1, 2))
plot(X[,1], type="l")
plot(X[,2], type="l")
Unmix using fastICA
par(mfcol = c(1, 2))
plot(1:500, a$S[,1], type = "l", xlab = "S'1", ylab = "")
plot(1:500, a$S[,2], type = "l", xlab = "S'2", ylab = "")
plot the independent components as a heatmap
heatmap(a$S)
data(iris)
Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width.data <- iris %>%
select(c(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width))
flower_names <- iris %>%
as_tibble() %>%
select(Species)
Step 1. Choose the K number of clusters.
Step 2. Split the data randomly into K groups.
Step 3. Iterate until the cluster assignments stop changing:
K = 2
set.seed(10)
# Randomly assign a number, from 1 to K, to each of the observations.
idx <- sample(c(TRUE, FALSE), nrow(data), replace=TRUE, prob=c(0.5, 0.5))
K1_data <- data[idx, ]
K1_names <- flower_names[idx,]
K2_data <- data[!idx, ]
K2_names <- flower_names[!idx,]
pcs<- as.data.frame(prcomp(data)$x)
pcs$Cluster1 <- idx
(current_clust <- ggplot(pcs, aes(PC1, PC2, color = Cluster1)) +
geom_point() +
theme_bw() +
ggsci::scale_color_aaas())
# create a list to keep track of the data and calculate centroid
K1 <- list(observations = K1_data,
names = K1_names,
centroid = colMeans(K1_data))
K2 <- list(observations = K2_data,
names = K2_names,
centroid = colMeans(K2_data))
# Assign each observation to the cluster whose centroid is the closest.
# lets define a clean distance function so we know what is going on
sq_euclidean_dist <- function(x1, x2) {
sum((x1 - x2) ^ 2)
}
# now calculate the squared euclidean distance between each observation adn the cluster centroid
dist_K1 <- apply(data, 1, function(x){ sq_euclidean_dist(x, K1$centroid) })
dist_K2 <- apply(data, 1, function(x){ sq_euclidean_dist(x, K2$centroid) })
dist_clust <- cbind(dist_K1, dist_K2)
# assign each observation to the closest cluster
idx <- apply(dist_clust, 1, function(x){ which.min(x) == 1})
K1_data <- data[idx, ]
K1_names <- flower_names[idx,]
K2_data <- data[!idx, ]
K2_names <- flower_names[!idx,]
# create a list to keep track of the data and calculate centroid
K1 <- list(observations = K1_data,
names = K1_names,
centroid = colMeans(K1_data))
K2 <- list(observations = K2_data,
names = K2_names,
centroid = colMeans(K2_data))
pcs<- as.data.frame(prcomp(data)$x)
pcs$Cluster1 <- idx
(current_clust <- ggplot(pcs, aes(PC1, PC2, color = Cluster1)) +
geom_point() +
theme_bw())
# now calculate the squared euclidean distance between each observation adn the cluster centroid
dist_K1 <- apply(data, 1, function(x){ sq_euclidean_dist(x, K1$centroid) })
dist_K2 <- apply(data, 1, function(x){ sq_euclidean_dist(x, K2$centroid) })
dist_clust <- cbind(dist_K1, dist_K2)
# assign each observation to the closest cluster
idx <- apply(dist_clust, 1, function(x){ which.min(x) == 1})
K1_data <- data[idx, ]
K1_names <- flower_names[idx,]
K2_data <- data[!idx, ]
K2_names <- flower_names[!idx,]
# create a list to keep track of the data and calculate centroid
K1 <- list(observations = K1_data,
names = K1_names,
centroid = colMeans(K1_data))
K2 <- list(observations = K2_data,
names = K2_names,
centroid = colMeans(K2_data))
pcs<- as.data.frame(prcomp(data)$x)
pcs$Cluster1 <- idx
(current_clust <- ggplot(pcs, aes(PC1, PC2, color = Cluster1)) +
geom_point() +
theme_bw())
autoplot(prcomp(data))
a <- fastICA(data, 3, alg.typ = "parallel", fun = "logcosh", alpha = 1,
method = "R", row.norm = FALSE, maxit = 200,
tol = 0.0001, verbose = FALSE)
plot the independent components as a heatmap
heatmap(a$S)
plot(cluster::silhouette(kmeans(as.matrix(data), 3)$cluster, dist = dist(data)))
library(factoextra)
## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
fviz_nbclust(as.matrix(data), kmeans, method = "silhouette")
kmeans_2 <- kmeans(as.matrix(data),2)
ggplot(prcomp(data)$x, aes(PC1, PC2, color = flower_names$Species)) +
geom_point() +
theme_bw() +
ggsci::scale_color_jama(name = "Species") +
ggtitle("Species")
pc_data <- prcomp(data)$x
ggplot(pc_data, aes(PC1, PC2, color = as.character(kmeans_2$cluster))) +
geom_point() +
theme_bw() +
ggsci::scale_color_jco(name = "clusters K=2") +
ggtitle("Clusters K = 2")
Linkage: Average
Distance: Euclidean
Cut: h = 4
tree_hw <- hclust(dist(data), method = "average")
plot(tree_hw)
tree_hw_cut <- cutree(tree_hw, k = NULL, h = 4)
rect.hclust(tree_hw, k = NULL, h = 4)
ggplot(pc_data, aes(PC1, PC2, color = as.character(tree_hw_cut))) +
geom_point() +
theme_bw() +
ggsci::scale_color_jco(name = "Linkage: Average
Distance: Euclidean
Cut: h = 4")
Linkage: Average
Distance: Euclidean
Cut: k = 5
tree_hw <- hclust(dist(data), method = "average")
plot(tree_hw)
tree_hw_cut <- cutree(tree_hw, k = 5, h = NULL)
rect.hclust(tree_hw, k = 5, h = NULL)
ggplot(pc_data, aes(PC1, PC2, color = as.character(tree_hw_cut))) +
geom_point() +
theme_bw() +
ggsci::scale_color_jco(name = "Linkage: Average
Distance: Euclidean
Cut: k = 5")
Linkage: Complete
Distance: Euclidean
Cut: k = 4
tree_hw <- hclust(dist(data), method = "complete")
plot(tree_hw)
tree_hw_cut <- cutree(tree_hw, h = NULL, k = 4)
rect.hclust(tree_hw, h = NULL, k = 4)
ggplot(pc_data, aes(PC1, PC2, color = as.character(tree_hw_cut))) +
geom_point() +
theme_bw() +
ggsci::scale_color_jco(name = "Linkage: Complete
Distance: Euclidean
Cut: k = 4")
Linkage: Single
Distance: Euclidean
Cut: k = 4
tree_hw <- hclust(dist(data), method = "single")
plot(tree_hw)
tree_hw_cut <- cutree(tree_hw, h = NULL, k = 4)
rect.hclust(tree_hw, h = NULL, k = 4)
ggplot(pc_data, aes(PC1, PC2, color = as.character(tree_hw_cut))) +
geom_point() +
theme_bw() +
ggsci::scale_color_jco(name = "Linkage: Single
Distance: Euclidean
Cut: k = 4")
Linkage: Single
Distance: Manhattan
Cut: k = 4
tree_hw <- hclust(dist(data, method = "manhattan"), method = "single")
plot(tree_hw)
tree_hw_cut <- cutree(tree_hw, h = NULL, k = 4)
rect.hclust(tree_hw, h = NULL, k = 4)
ggplot(pc_data, aes(PC1, PC2, color = as.character(tree_hw_cut))) +
geom_point() +
theme_bw() +
ggsci::scale_color_jco(name = "Linkage: Single
Distance: Manhattan
Cut: k = 4")
Linkage: Single
Distance: Canberra
Cut: k = 4
tree_hw <- hclust(dist(data, method = "canberra"), method = "single")
plot(tree_hw)
tree_hw_cut <- cutree(tree_hw, h = NULL, k = 4)
rect.hclust(tree_hw, h = NULL, k = 4)
ggplot(pc_data, aes(PC1, PC2, color = as.character(tree_hw_cut))) +
geom_point() +
theme_bw() +
ggsci::scale_color_jco(name = "Linkage: Single
Distance: Canberra
Cut: k = 4")
On PCA:
Eigen Vectors and Eigen Values http://www.visiondummy.com/2014/03/eigenvalues-eigenvectors/ Linear Algebra by Prof. Gilbert Strang https://ocw.mit.edu/courses/mathematics/18-06-linear-algebra-spring-2010/video-lectures/ http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues
On ICA:
Independent Component Analysis: Algorithms and Applications https://www.cs.helsinki.fi/u/ahyvarin/papers/NN00new.pdf Tutorial on ICA taken from http://rstudio-pubs-static.s3.amazonaws.com/93614_be30df613b2a4707b3e5a1a62f631d19.html